A Grammar Based Analysis of Column Header Categories for Web Tables

نویسندگان

  • Mukkai Krishnamoorthy
  • Sharad Seth
  • Ramana Jandhyala
  • George Nagy
چکیده

As part of a project to harvest semi-structured data from web tables, we describe an approach to extract an abstract representation of the column-header categories based on a context-free grammar for linear strings. The column-header structure is generally an XY-tessellation. The grammar provides a compact representation of infinitely many structural variations possible within column headers. Before parsing, the 2D column-header structure is converted to a linear string of its atomic cell labels and delimiters for the X and Y cuts. The acceptable strings represent a superset of admissible column-header structures from which the invalid ones are eliminated by performing geometric and lexical checks on the labels of the parse tree. Experimental results on web tables show that 80% of the headers in the sample could be processed successfully using the grammatical approach.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Clustering header categories extracted from web tables

Revealing related content among heterogeneous web tables is part of our long term objective of formulating queries over multiple sources of information. Two hundred HTML tables from institutional web sites are segmented and each table cell is classified according to the fundamental indexing property of row and column headers. The categories that correspond to the multi-dimensional data cube vie...

متن کامل

Recovering Semantics of Tables on the Web

The Web offers a corpus of over 100 million tables [6], but the meaning of each table is rarely explicit from the table itself. Header rows exist in few cases and even when they do, the attribute names are typically useless. We describe a system that attempts to recover the semantics of tables by enriching the table with additional annotations. Our annotations facilitate operations such as sear...

متن کامل

Development of a site-specific regression model for assessment of road-header cutting performance of Tabas coal mine based on rock properties

In underground excavation, where the road-headers are employed, a precise prediction of the road-header performance has a vital role in the economy of the project. In this paper, a new model is developed for prediction of the road-header performance using the non-linear multivariate regression analysis. This model is able to estimate the instantaneous cutting rate (ICR) of roadheader based on r...

متن کامل

Author Manuscript, Published in "actes Du 27e Colloque International Sur Le Lexique Et La Grammaire a Generic Tool to Generate a Lexicon for Nlp from Lexicon-grammar Tables

Symbolic approaches to deep parsing often require large-coverage and fine-grained lexical information, such as a syntactic lexicon. LexiconGrammar tables (Gross 1975, 1994), carefully developed by linguists since the 70s, constitute such a syntactic resource. Each table represents a class of predicates sharing some syntactic features. Each row corresponds to a lexical entry (verb, predicative n...

متن کامل

A Tiled-Table Convention for Compressing FITS Binary Tables

This document describes a convention for compressing FITS binary tables that is modeled after the FITS tiled-image compression method (White et al. 2009) that has been in use for about a decade. The input table is first optionally subdivided into tiles, each containing an equal number of rows, then every column of data within each tile is compressed and stored as a variable-length array of byte...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010